13 May, 2020

Introduction

South Korea during COVID-19

South Korea is one of the world’s most densely populated countries with 51.64 million people.

The first case of COVID-19 in this country was confirmed on the 20th of January 2020. Since then, there has only been 256 deaths caused by COVID-19.

Research questions

  • How has the epidemic evolved in South Korea?

  • Is there any correlation between the place of infection and severity of the disease?

  • Does any gender or age predispose for getting the disease or for a more severe outcome?

  • Are there any characteristic features in the high-prevalence disease areas?

  • Can a prediction of the disease confirmation be made based on the city?

Materials and methods

Workflow and Structure of the project

Reproducibility

  • The project includes all steps in the data analysis
  • To achieve consistent computational results

Data cleaning

  • Remove non valid data (NA’s)

  • Remove non necessary columns.

  • Converting data into a tidy format following the tidy rules

    • Each variable has a column

    • Each observation has its own row

    • Each value has its own cell

Data augmenting

  • Joining dataset tables using full_join

  • Subsetting data *?

  • Combining columns using unite

  • Creating new variables for the analysis

Final datasets

  • Case data ( Case )

  • Patient data (Patient info + Patient route)

  • Time data (Time + Time age + Time gender + Time province + SearchTrend)

  • City data (region + Patient info)

Results

Research question

Research question

Research question

Research question

Research question

Research question

Research question

Research question

Research question

Research question

Research question

score_org score_pca
42.5% 49.6%

Research question

ANN Network
accuracy
34.5%

Shiny app

Conclusion and discussion

  • Confirmed cases is high compared to deaths.

  • One peak (until beginning of April), follows logistic model

  • There’s no correlation between the place of infection and severity of the disease

  • Men die but more women are confirmed to be sick. Young people are driving the spread.

  • At least from the retrieved data, there is no strong difference.

  • People in their 70s and 80s have a higher fatality rage (as expected).

  • There are clusters of superspreaders and certain age range can be observed in each.

  • A higher disease prevalence can be observed in bigger cities and those who have nursing homes, which is very different from what can be seen in the countryside, where elderly population ratio and elderly alone present less cases.

  • Accuracy is just above 50 % - better than random with 4 classes.

  • Similar performance as kmeans.

Problems and solutions

Using different packages will mask some functions.

  • detach packages after each R cript.

Superspreaders

Correlation matrix

*** ### PCA Variance explained

Regional cases plot